1
Defining Optimality in Statistical Inference
MATH003 Lesson 8
00:00
In the vast wilderness of statistical data, we are hunters seeking the truth—the true parameter $\psi(\theta)$. But how do we decide which arrow (estimator) is best? Optimality is not a vague feeling; it is the mathematical art of minimizing loss. To find the 'best' estimator, we look to the Mean Squared Error (MSE), which elegantly decomposes into the tension between two fundamental forces: Variance and Bias.

Defining the Gold Standard: MSE

To quantify how far our guess $T$ is from the reality $\psi(\theta)$, we define the Mean Squared Error (Definition 6.3.1):

$$MSE_\theta(T) = E_\theta((T - \psi(\theta))^2)$$

This is the average squared distance between our estimator and the target. A perfect estimator would have an MSE of zero, but in a world of random noise, we strive to minimize it.

Theorem 8.1.1: The Architecture of Error

Why does an estimator fail? Theorem 8.1.1 provides the blueprint. If $T$ has a finite second moment, the error relative to any constant $c$ is given by:

$E((T - c)^2) = \text{Var}(T) + (E(T) - c)^2$

This formula reveals that the total squared error is minimized only when we choose $c = E(T)$. In the context of inference, we set $c = \psi(\theta)$, leading to the famous decomposition:

MSE = Variance + Bias$^2$

The Precision-Accuracy Tradeoff

Imagine two weighing scales in a quality control lab:

  • The Precise Relic: It gives the same weight every time (low Variance) but is miscalibrated by 2 grams (high Bias).
  • The Erratic Sage: It is correct on average (zero Bias) but oscillates wildly between measurements (high Variance).

Theorem 8.1.1 allows us to calculate exactly which scale provides the lower total error. Often, we are willing to accept a small amount of systematic deviation (Bias) if it drastically reduces the noise (Variance).

Example 8.1.1: Sufficiency and Information

Optimality is tied to Information. Consider a sample space $S = \{1, 2, 3, 4\}$. If outcomes 2, 3, and 4 are equally likely under every possible parameter, they carry the same likelihood. We can define a sufficient statistic $U$ that groups these outcomes together without losing any ability to make an optimal inference. As shown in the simulation, if $L(\cdot|2) = L(\cdot|3) = L(\cdot|4)$, an optimal estimator treats these as a single informative event.

🎯 Core Principle
An estimator is optimal when it minimizes the expected loss. For squared error loss, this means finding the point where the sum of Variance and Bias² is at its absolute minimum.